# Systolic Architecture Design

Lan-Da Van (范倫達), *Ph. D.*Department of Computer Science
National Chiao Tung University
Taiwan, R.O.C. *Fall, 2010* 



Idvan@cs.nctu.edu.tw



http://www.cs.nctu.tw/~ldvan/



### Outline

- Introduction
- Systolic Array Design Methodology
- FIR Systolic Arrays
- Selection of Scheduling Vector
- Conclusion





# Systolic Architecture

- What is systolic architecture (also called Systolic Arrays)?
- A network of PEs that rhythmically compute and pass data through the system.
- Used as a coprocessor in combination with a host computer and the behavior is analogous to the flow of blood through the heart; thus named as systolic.



Lan-Da Van VLSI-DSP-7-3



# Characteristics of Systolic Arrays

- Synchronization
- Modularity
- Regularity
- Locality
- Finite Connection
- Parallel/Pipeline
- Extendibility
- Some relaxations are introduced to increase the utility of systolic arrays
  - Neighbor interconnection ( near, but not nearest )
  - Data broadcast operations
  - Different PEs, especially at the boundaries



Lan-Da Van VLSI-DSP-7-4



### Outline

- Introduction
- Systolic Array Design Methodology
- FIR Systolic Arrays
- Selection of Scheduling Vector
- Conclusion





# Systolic Array Design Methodology





Lan-Da Van VLSI-DSP-7-6



# Design Methodology: Basic Vectors

- Projection vector d<sup>T</sup> = [d<sub>1</sub> d<sub>2</sub>]
  - Determine how DG is compressed.
  - Two nodes that are displaced by d or multiples of d are executed by the same processor
- Processor space vector p<sup>T</sup> = [p<sub>1</sub> p<sub>2</sub>]
  - Any node with index  $I^T = [i, j]$  would be executed by processor  $p^T I$ .
- Schedule vector s<sup>T</sup> = [s<sub>1</sub> s<sub>2</sub>]
  - Any node with index  $I^T = [i, j]$  would be executed at **time s**<sup>T</sup>**I**.
- ♦ Hardware utilization efficiency: HUE = 1/|s<sup>T</sup>d|
  - This is because two tasks executed by the same processor are spaced 1/|s<sup>T</sup>d| time units apart.
- Feasibility constrains
  - Processor space vector and the projection vector must be orthogonal to each other. p is orthogonal to d, that is, p<sup>T</sup>d = 0
    - If A and B differ by projection vector, i.e, I<sub>A</sub>-I<sub>B</sub> = d,
       then they must be executed by the same processor => p<sup>T</sup>I<sub>A</sub> = p<sup>T</sup>I<sub>B</sub> =>p<sup>T</sup>(I<sub>A</sub>-I<sub>B</sub>) = 0
       => p<sup>T</sup>d = 0
  - If A and B are mapped to the same processor, then they cannot be executed at the same time, i.e, s<sup>T</sup>I<sub>A</sub> ≠ s<sup>T</sup>I<sub>B</sub> => s<sup>T</sup>d ≠ 0
  - Edge mapping: If an edge e exists in DG, then an edge p<sup>T</sup>e exists in the systolic array with s<sup>T</sup>e delays.





# Space to Space-Time Representation

- Space-time representation
  - Interpreting one of the spatial dimensions as temporal dimension
  - j': processor axis, t': scheduling time instance

$$\begin{pmatrix} i' \\ j' \\ t' \end{pmatrix} = \begin{pmatrix} 0 & 0 & 1 \\ p^T & 0 \\ s^T & 0 \end{pmatrix} \begin{pmatrix} i \\ j \\ t \end{pmatrix}$$

$$i'=t$$
,  $j'=\mathbf{p}^T I$ ,  $t'=\mathbf{s}^T I$ 



Lan-Da Van VLSI-DSP-7-8



### Outline

- Introduction
- Systolic Array Design Methodology
- FIR Systolic Arrays
- Selection of Scheduling Vector
- Matrix-Matrix Multiplication and 2D Systolic Array Design
- Systolic Design for Space Representations Containing Delays
- Conclusion





# Systolic Array Design Methodology







### DG of FIR Filter

- Dependence Graph (DG)
  - Ex: FIR filter:  $y(n) = w_0(n)x(n)+w_1x(n-1)+w_2x(n-2)$





Lan-Da Van VLSI-DSP-7-11



# Systolic Array Design Methodology







### Applying Projection and Scheduling (1/2)

# Processor vector $p^{T} = [0 \ 1]$ Projection vector apply $d^{T} = [1 \ 0]$

Scheduling vector  $s^{T} = [1 \ 0]$ 

#### Part of DG:



$$p^T I = \begin{bmatrix} 0 & 1 \end{bmatrix} \begin{bmatrix} 0 \\ 2 \end{bmatrix} = 2$$

$$s^T I = \begin{bmatrix} 1 & 0 \end{bmatrix} \begin{bmatrix} 0 \\ 2 \end{bmatrix} =$$



$$p^T I = \begin{bmatrix} 0 & 1 \end{bmatrix} \begin{bmatrix} 0 \\ 1 \end{bmatrix} = 1$$

$$s^T I = \begin{bmatrix} 1 & 0 \end{bmatrix} \begin{bmatrix} 0 \\ 1 \end{bmatrix} = 0$$
  $s^T I = \begin{bmatrix} 1 & 0 \end{bmatrix} \begin{bmatrix} 1 \\ 1 \end{bmatrix} = 1$ 



$$p^T I = \begin{bmatrix} 0 & 1 \end{bmatrix} \begin{bmatrix} 0 \\ 0 \end{bmatrix} = 0$$

$$s^T I = \begin{bmatrix} 1 & 0 \end{bmatrix} \begin{bmatrix} 0 \\ 0 \end{bmatrix} = 0$$



$$p^T I = \begin{bmatrix} 0 & 1 \end{bmatrix} \begin{bmatrix} 0 \\ 2 \end{bmatrix} = 2$$
  $p^T I = \begin{bmatrix} 0 & 1 \end{bmatrix} \begin{bmatrix} 1 \\ 2 \end{bmatrix} = 2$ 

$$s^T I = \begin{bmatrix} 1 & 0 \end{bmatrix} \begin{bmatrix} 0 \\ 2 \end{bmatrix} = 0$$
  $s^T I = \begin{bmatrix} 1 & 0 \end{bmatrix} \begin{bmatrix} 1 \\ 2 \end{bmatrix} = 1$ 



$$p^T I = \begin{bmatrix} 0 & 1 \end{bmatrix} \begin{bmatrix} 0 \\ 1 \end{bmatrix} = 1$$
  $p^T I = \begin{bmatrix} 0 & 1 \end{bmatrix} \begin{bmatrix} 1 \\ 1 \end{bmatrix} = 1$ 

$$s^T I = \begin{bmatrix} 1 & 0 \end{bmatrix} \begin{bmatrix} 1 \\ 1 \end{bmatrix} = 1$$



$$p^{T}I = \begin{bmatrix} 0 & 1 \end{bmatrix} \begin{bmatrix} 0 \\ 0 \end{bmatrix} = 0 \qquad p^{T}I = \begin{bmatrix} 0 & 1 \end{bmatrix} \begin{bmatrix} 1 \\ 0 \end{bmatrix} = 0 \qquad \blacksquare$$

$$s^T I = \begin{bmatrix} 1 & 0 \end{bmatrix} \begin{bmatrix} 1 \\ 0 \end{bmatrix} = 1$$



processor 2



processor 1



processor 0



SFG



# Applying Projection and Scheduling (2/2)





**Dependence Graph** Applying projection and Scheduling

**Space-time representation** 



Lan-Da Van



# Systolic Array Design Methodology







## Edge Mapping







# Edge Mapping

#### Example:

$$input = e_1 = \begin{bmatrix} 0 \\ 1 \end{bmatrix}$$

$$weight = e_2 = \begin{bmatrix} 1 \\ 0 \end{bmatrix}$$

$$output = e_3 = \begin{bmatrix} 1 \\ -1 \end{bmatrix}$$

$$\mathbf{p^T} = [0\ 1]$$





### Edge mapping

| е                          | p <sup>T</sup> e | s <sup>T</sup> e |
|----------------------------|------------------|------------------|
| Input [0 1] <sup>T</sup>   | 1                | 0                |
| Weight [1 0] <sup>T</sup>  | 0                | 1                |
| Output [1 -1] <sup>T</sup> | -1               | 1                |

Edge mapping table







# Systolic Array Design Methodology







# Construct the Final Systolic Architecture







## Alternative Designs

- ◆ B<sub>1</sub> (Broadcast inputs, Move results, Weight Stay)
- B<sub>2</sub> (Broadcast inputs, Move Weight, Results stay)
- F (Fan-in results, Move inputs, Weight stay)
- R<sub>1</sub> (Results stay, Inputs and Weight move in opposite directions)
- R<sub>2</sub> and Dual R<sub>2</sub> (Results stay, Inputs and Weights move in the same direction but at different speeds)
- W<sub>1</sub> (Weights stay, Inputs and Results move in opposite directions)
- W<sub>2</sub> and Dual W<sub>2</sub> (Weights stay, Inputs and Results move in same direction but at different speeds)
- Relating systolic designs using transformations





# B<sub>2</sub> – Broadcast Inputs, Move Weight, Results Stay



| е                          | р <sup>т</sup> е | s <sup>T</sup> e |
|----------------------------|------------------|------------------|
| wt [1 0] <sup>⊤</sup>      | 1                | 1                |
| input [0 1] <sup>⊤</sup>   | 1                | 0                |
| result [1 -1] <sup>⊤</sup> | 0                | 1                |

$$HUE = \frac{1}{\left| s^T d \right|} = 1$$







# F - Fan-in Results, Move Inputs, Weight Stay



| е                          | р <sup>т</sup> е | s <sup>T</sup> e |
|----------------------------|------------------|------------------|
| wt [1 0] <sup>⊤</sup>      | 0                | 1                |
| input [0 1] <sup>⊤</sup>   | 1                | 1                |
| result [1 -1] <sup>⊤</sup> | -1               | 0                |

$$HUE = \frac{1}{\left|S^T d\right|} = 1$$







Lan-Da Van

VLSI-DSP-7-23



# R<sub>1</sub> - Results Stay, Inputs and Weight Move in Opposite Directions



| е                          | р <sup>т</sup> е | s <sup>T</sup> e |
|----------------------------|------------------|------------------|
| wt [1 0] <sup>T</sup>      | 1                | 1                |
| input [0 1] <sup>T</sup>   | -1               | 1                |
| result [1 -1] <sup>⊤</sup> | 0                | 2                |

$$HUE = \frac{1}{\left|s^{T}d\right|} = \frac{1}{2}$$

I-DSP-7-24







# R<sub>2</sub> and Dual R<sub>2</sub>-Results Stay, Inputs and Weights Move in the Same Direction but at Different Speeds







| HUE = | 1                              | _ 1 |
|-------|--------------------------------|-----|
| HUL – | $\overline{\left s^Td\right }$ | — J |

result





# W<sub>1</sub> – Weights Stay, Inputs and Results Move in Opposite Directions



$$HUE = \frac{1}{\left|s^{T}d\right|} = \frac{1}{2}$$







# W<sub>2</sub> and Dual W<sub>2</sub>-Weights Stay, Inputs and Results Move in Same Direction but at Different Speeds





| е                          | р <sup>т</sup> е | s <sup>T</sup> e |
|----------------------------|------------------|------------------|
| <br>wt [1 0] <sup>⊤</sup>  | 0                | 1                |
| input [0 1] <sup>T</sup>   | -1               | 1                |
| result [1 -1] <sup>⊤</sup> | -1               | 2                |

$$HUE = \frac{1}{\left| s^T d \right|} = 1$$





### Relating Systolic Designs Using Transformations

- The same projection vector and processor space vector
- Different scheduling vectors
- Can derive each other using transformations
  - Edge reversal: reverse edge direction in DG when no precedence constraints
  - Associativity: when accumulating (a+b)+c = a+(b+c)
  - Slow-down
  - Retiming
  - Pipelining





# **Cutset Retiming Transformation**







### Outline

- Introduction
- Systolic array design methodology
- FIR systolic arrays
- Selection of scheduling vector
- Conclusion





# Scheduling Inequalities (1/3)

Based on selected scheduling vector s<sup>T</sup>, the projection vector d and the processor space vector p<sup>T</sup> can be selected.

$$\mathbf{p}^{T}(\mathbf{I}_{A} - \mathbf{I}_{B}) = 0 \Longrightarrow \mathbf{p}^{T}\mathbf{d} = 0$$

$$\mathbf{s}^{T}\mathbf{I}_{A} \neq \mathbf{s}^{T}\mathbf{I}_{B} \Longrightarrow \mathbf{s}^{T}\mathbf{d} \neq 0$$

Consider the dependence relation X -> Y,

$$X: \mathbf{I}_{x} = \begin{pmatrix} i_{x} \\ j_{x} \end{pmatrix} \longrightarrow Y: \mathbf{I}_{y} = \begin{pmatrix} i_{y} \\ j_{y} \end{pmatrix}$$

where I<sub>x</sub> and I<sub>y</sub> are the indices of node X and node Y, respectively. The scheduling inequality for this dependence is defined as



# Scheduling Inequalities (2/3)

$$S_{v} \ge S_{x} + T_{x}$$
 Eq. (1)

Where  $T_x$  is the time to compute node X and  $S_x$ ,  $S_y$  are the scheduling times for nodes X, Y, respectively.

Linear scheduling 
$$S_x = s^T I_x = (s_1 \quad s_2) {i_x \choose j_x}$$
 Eq. (2)

$$S_y = s^T I_y = (s_1 \quad s_2) inom{i_y}{j_y}$$
 Affine scheduling

$$(s_2) \left( egin{array}{c} i_y \ j_y \end{array} 
ight)$$

$$S_{x} = S^{T} I_{x} + \gamma_{x} = (s_{1} \quad s_{2}) \begin{pmatrix} i_{x} \\ j_{x} \end{pmatrix} + \gamma_{x}$$

$$S_{y} = S^{T} I_{y} + \gamma_{y} = (s_{1} \quad s_{2}) \begin{pmatrix} i_{y} \\ j_{y} \end{pmatrix} + \gamma_{y}$$

Eq. (3)



# Scheduling Inequalities (3/3)

Define the edge from node X to node Y as

$$\mathbf{e}_{x-y} = \mathbf{I}_y - \mathbf{I}_x$$

Eqs. (1) & (2) 
$$=> \mathbf{s}^T \mathbf{e}_{x-y} + r_y - r_x \ge T_x$$

- Hence the selection of scheduling vector consists of two steps:
  - Capture all the fundamental edges. The reduced dependence graph (RDG) is used to capture the fundamental edges and the regular iterative algorithm (RIA) description of the corresponding problem is used to construct RDGs.
  - Construct the scheduling inequalities and solve them for feasible s<sup>T</sup>.





# Regular Iterative Algorithm (RIA)

- The regular iterative algorithm is the method for constructing the reduce dependence graph (RDG).
- The regular iterative algorithm (RIA) has two standard forms:
  - The RIA is in standard input RIA form if the index of the inputs are the same for all equations.
  - The RIA is in standard output RIA form if output indices are the same for all equations.
- FIR example:

$$W(i+1, j) = W(i, j)$$
  
 $X(i, j+1) = X(i, j)$   
 $Y(i+1, j-1) = Y(i, j)$   
 $+W(i+1, j-1)X(i+1, j-1)$ 

Output RIA Form

$$W(i, j) = W(i-1, j)$$
  
 $X(i, j) = X(i, j-1)$   
 $Y(i, j) = Y(i-1, j+1)$   
 $+W(i, j)X(i, j)$ 



### Scheduling Vector and Systolic Array Design Using RDG

- Constructing scheduling inequalities using RDG
- Determine the scheduling vector using scheduling inequalities
- Systolic mapping using the scheduling vector
- This formulation can accommodate different computation times for various operations due to its generality.





# Example 7.4.1 (1/4)

Example 7.4.1 Assume that the time to perform multiplication, addition, and communication are as follows:

$$T_{mult} = 5, \quad T_{add} = 2, \quad T_{com} = 1.$$

Recall the scheduling inequality for an edge in a DG is given by:

$$\mathbf{s}^T \mathbf{e} + \gamma_y - \gamma_x \ge T_x$$

where

$$\mathbf{s} = \left( \begin{array}{c} s_1 \\ s_2 \end{array} \right).$$

There are 5 edges in the above RDG.





## Example 7.4.1 (2/4)

### Reduced Dependence Graph (RDG)

$$W \to Y : e = \begin{pmatrix} 0 \\ 0 \end{pmatrix}, \gamma_y - \gamma_x \ge 0$$

$$X \to X : e = \begin{pmatrix} 0 \\ 1 \end{pmatrix}, s_2 + \gamma_x - \gamma_x \ge 1$$

$$W \to W : e = \begin{pmatrix} 1 \\ 0 \end{pmatrix}, s_1 + \gamma_w - \gamma_w \ge 1$$

$$X \to Y : e = \begin{pmatrix} 0 \\ 0 \end{pmatrix}, \gamma_y - \gamma_x \ge 0$$

$$Y \to Y : e = \begin{pmatrix} 1 \\ -1 \end{pmatrix}, s_1 - s_2 + \gamma_y - \gamma_y \ge 5 + 2 + 1$$





## Example 7.4.1 (3/4)

For linear scheduling,  $\gamma_x = \gamma_y = \gamma_w = 0$ . Simplifying these equations, we have

$$s_1 \ge 1$$
,  $s_2 \ge 1$ ,  $s_1 - s_2 \ge 8$ .

Therefore, one of the solutions is

$$s_2 = 1, s_1 = 8 + 1 = 9 \Rightarrow \mathbf{s}^T = (9, 1).$$

Now, select  $\mathbf{d} = (1, -1)$  such that  $\mathbf{s}^T \mathbf{d} \neq 0$  and select  $\mathbf{p}^T$  such that  $\mathbf{p}^T \mathbf{d} = 0$ . Choose  $\mathbf{p}^T = (1, 1)$ . Since

$$\mathbf{s}^T\mathbf{d} = \begin{pmatrix} 9 & 1 \end{pmatrix} \begin{pmatrix} 1 \\ -1 \end{pmatrix} = 8,$$

therefore HUE = 1/8.





# Example 7.4.1 (4/4)

### Linear scheduling

$$s_1 \ge 1, s_2 \ge 1, s_1 - s_2 \ge 8$$
  
 $s_2 = 1, s_1 = 9 \rightarrow s^T = \begin{pmatrix} 9 & 1 \end{pmatrix}$   
 $d = (1, -1), p^T = (1, 1)$ 

| е            | р <sup>т</sup> е | s <sup>T</sup> e |
|--------------|------------------|------------------|
| wt(1,0)      | 1                | 9                |
| i/p(0,1)     | 1                | 1                |
| Result(1,-1) | 0                | 8                |

### Systolic array architecture





### Conclusion

- Systolic architecture
  - A massively parallel processing with limited I/O communication with host computer
  - Suitable for many regular interactive operations
- Design methodology
  - Map an N-dimensional DG to (N-1) dimensional space-time representation
  - Needs to determine three critical vectors
    - Projection vector
    - Processor space vector
    - Scheduling vector

